0%

(CVPR 2018) Cascaded Pyramid Network for Multi-Person Pose Estimation

Keyword [CPN]

Chen Y, Wang Z, Peng Y, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7103-7112.



1. Overview


Challenge cases of multi-person pose estimation, such as

  • occluded keypoints
  • invisible ketpoints
  • complex background

In this paper, it proposed Cascaded Pyramid Network (CPN)

  • GlobalNet. localize simple keypoint
  • RefineNet. explicitly handle hard keypoint (online hard keypoints mining)
  • Top-down pipeline. generate human box based on detector first


1.1. Contribution

  • CPN
  • Explore the effects of various factors in top-down pipeline
  • Classical. pictorial structure, graphical model, tree structure and hand-crafted feature
  • Multi-Person. top-down and bottom-up
  • Single-Person. regressors, heatmap and score map
  • Human Detection. one stage and two stages

1.3. Dataset&Metrics

  • MS COCO. trainval (57k images and 150k person instances), minival (5k images), test-dev (20k) and test-challenge (20k).
  • OKS-based mAP. (object keypoints similarity)



2. Architecture





2.1. GlobalNet

  • Top-Down: C2, C3, C4, C5.
    • C2,C3. High spatial resolution for localization but low semantic information for recognition
    • C4,C5. More semantic information but low spatial resolution
  • Drawbacks: the hard keypoint requires more context rather than the appearance feature nearby

2.2. RefineNet

  • Stack more bottleneck blocks in deeper layers (small spatial)
  • explicitly select the hard keypoint online (top M) based on training loss and BP the loss from the them.



3. Experiments


3.1. Data Process

  • box 256:192 and resize to 256x192
  • flip, rotation (-40~+40), scale (0.7~1.3)

3.2. Test

  • ensemble mechanism

3.3. Ablation Study

3.3.1. NMS strategy

soft-NMS surpasses hard-NMS.



3.3.2. Detector Performance

AP less important for pose estimation.



3.3.3. Hard Keypoints Number

M = 8 works well.



3.3.4. With\Without



3.3.5. Concatenation



3.3.6. Dilation

Dilation increase AP and FLOPs.



3.3.7. Image Size



3.4. Comparison